{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Compute the completeness of CPT and DA models\n",
    "\n",
    "* This notebook demonstrates how we can estimate the completeness of two economic models, *Cumulative Prospect Theory (CPT)* and *Dissapointment Aversion (DA)* by our code\n",
    "* Use the dataset of Bruhin, Fehr-Duda, and Epper (2010, ECMA), \"Risk and Rationality: Uncovering Heterogeneity in Probability Distortion\"\n",
    "  - the prediction of certainty equivalents for a set of 25 binary lotteries\n",
    "\n",
    "**Preparation**\n",
    "* Locate the python script `completeness.py` at the same directory as this notebook\n",
    "* Also put the data file in a suitable folder\n",
    "  - In this notebook, the data file is located at `../data/Bruhin_et_al_2010.mat`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from completeness import *\n",
    "from sklearn.model_selection import KFold\n",
    "\n",
    "## mat file\n",
    "from scipy.io import loadmat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Convert .mat file to pandas dataframe if necessary\n",
    "\n",
    "**Reference**\n",
    "* https://www.askpython.com/python/examples/mat-files-in-python"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ce</th>\n",
       "      <th>female</th>\n",
       "      <th>highincome</th>\n",
       "      <th>id</th>\n",
       "      <th>lottery</th>\n",
       "      <th>p1</th>\n",
       "      <th>semester</th>\n",
       "      <th>z1</th>\n",
       "      <th>z2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>25.25</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>0.25</td>\n",
       "      <td>4</td>\n",
       "      <td>50</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>17.50</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>8</td>\n",
       "      <td>0.95</td>\n",
       "      <td>4</td>\n",
       "      <td>20</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.75</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0.10</td>\n",
       "      <td>4</td>\n",
       "      <td>10</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>23.75</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0.90</td>\n",
       "      <td>4</td>\n",
       "      <td>50</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>13.25</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>0.50</td>\n",
       "      <td>4</td>\n",
       "      <td>20</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      ce  female  highincome  id  lottery    p1  semester  z1  z2\n",
       "0  25.25       1           0   1       10  0.25         4  50  20\n",
       "1  17.50       1           0   1        8  0.95         4  20   0\n",
       "3   1.75       1           0   1        1  0.10         4  10   0\n",
       "4  23.75       1           0   1        2  0.90         4  50   0\n",
       "6  13.25       1           0   1        4  0.50         4  20  10"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# load .mat file\n",
    "file_name = '../data/Bruhin_et_al_2010.mat'\n",
    "matdata = loadmat(file_name)\n",
    "\n",
    "# convert it to a pandas dataframe\n",
    "df_full = pd.DataFrame()\n",
    "for key in matdata.keys():\n",
    "    if not '__' in key:\n",
    "        df_full[key] = matdata[key].flatten()\n",
    "\n",
    "# drop lotteries over losses\n",
    "df = df_full[df_full['z2'] >= 0]\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Requirements**\n",
    "* The dataframe must have the columns with the following name:\n",
    "  - `ce`, `z1`, `z2`, `p1`, `id`, `lottery`\n",
    "  - the order of the columns does not matter\n",
    "\n",
    "**Data description**\n",
    "* each lottery is represented as a triple $(p_1, z_1, z_2)$.\n",
    "  - $z_1 > z_2$. $p_1$ is the probability of $z_1$'s realizing\n",
    "* `id`: ids for the subjects (179 subjects in total)\n",
    "* `lottery`: ids for the lotteries (25 lotteries in total)\n",
    "* `ce`: reported certainty equivalents by the subjects"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "179 187\n",
      "25 49\n"
     ]
    }
   ],
   "source": [
    "# NB: some ids are missing\n",
    "id_list = df['id'].unique()\n",
    "lottery_id_list = df['lottery'].unique()\n",
    "\n",
    "print(id_list.shape[0], id_list.max())\n",
    "print(lottery_id_list.shape[0], lottery_id_list.max())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Compute the completeness via cross validation\n",
    "* First, we show how to compute the completeness for CPT model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Compute the completeness $\\kappa$ via $K$-fold cross validation\n",
    "  - $K := 10$\n",
    "* Split the set of **subject ids** into ten fold\n",
    "  - The way of splitting the subjects should be fixed when comparing different models\n",
    "  - splits are generated and save them to `fold` directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# split the subjects and save\n",
    "subject_idx = df['id'].values\n",
    "idx_clustered = np.sort(np.unique(subject_idx)) # the set of subject indices (sorted)\n",
    "num_subject = idx_clustered.shape[0] # this corresponds to N in the paper\n",
    "\n",
    "np.random.seed(20220306)\n",
    "n_splits = 10\n",
    "kf = KFold(n_splits, shuffle=True)\n",
    "ind = 0\n",
    "for train_idx_clustered, test_idx_clustered in kf.split(idx_clustered):\n",
    "    # NB: kf.split returns indices for the array, not the subject indices\n",
    "    # train_subject_idx contains the subject ids for those who are included in the training data\n",
    "    train_subject_idx = idx_clustered[train_idx_clustered]\n",
    "    test_subject_idx = idx_clustered[test_idx_clustered]\n",
    "    np.savetxt('fold/train_subject_idx_{}.csv'.format(ind), train_subject_idx, fmt='%d')\n",
    "    np.savetxt('fold/test_subject_idx_{}.csv'.format(ind), test_subject_idx, fmt='%d')\n",
    "    \n",
    "    train_idx = np.isin(subject_idx, train_subject_idx)\n",
    "    test_idx = np.isin(subject_idx, test_subject_idx)\n",
    "    np.savetxt('fold/train_idx_{}.csv'.format(ind), train_idx, fmt='%d')\n",
    "    np.savetxt('fold/test_idx_{}.csv'.format(ind), test_idx, fmt='%d')\n",
    "    ind += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* The CPT model is trained via grid search\n",
    "  - Need to specify the search space for the parameters\n",
    "  - the grid for each parameter is generated first, and then combine them via `cartesian` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# choose param grids\n",
    "## large grid size is chosen here for test\n",
    "grid_size_alpha = 0.4\n",
    "grid_size_gamma = 0.4\n",
    "grid_size_delta = 1.0\n",
    "alpha_grid = np.arange(grid_size_alpha, 1+grid_size_alpha, grid_size_alpha)\n",
    "gamma_grid = np.arange(grid_size_gamma, 1+grid_size_gamma, grid_size_gamma)\n",
    "delta_grid = np.arange(grid_size_delta, 5+grid_size_delta, grid_size_delta)\n",
    "\n",
    "param_grid = cartesian_product(alpha_grid, gamma_grid, delta_grid)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* To compute the completeness, we first need to pass the dataset `df` (pandas dataframe) to function `preprocess`.\n",
    "* Then the outputs and `param_grid` to function `cross_validation`\n",
    "* See `completeness.py` for the details of implementation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "X, Y, subject_idx, cov_idx, cov_list = preprocess(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "num_subject: 179\n",
      "completeness: 0.9885216015204927\n",
      "stderr (analytical): 0.08609388093134117\n",
      "model_best_param: [0.8 0.4 1. ]\n"
     ]
    }
   ],
   "source": [
    "completeness, stderr, model_best_params = cross_validation(X, Y, subject_idx, cov_idx, cov_list, param_grid)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Compute SE via Bootstrap"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Compute the standard error of the completeness using a block bootstrapping procedure\n",
    "  - Clusters together all observations from the same subjects\n",
    "* The SE reported in the paper is computed using 1000 bootstrap samples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "stderr (bootstrapped): 0.02525558440925165\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.02525558440925165"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Here, bootstrap sample size is set to be small for a test (also with coarse grid)\n",
    "se_bootstrap = bootstrap_se(X, Y, subject_idx, cov_idx, cov_list, param_grid, bs_sample_size=10)\n",
    "se_bootstrap"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. DA model\n",
    "* We can compute the completeness for DA model in almost the same manner\n",
    "* You should set the `base_param`, `pred` and `train_model` options in `cross_validation` function\n",
    "    - default: `base_param=[1,1,1]`, `pred=pred_CPT, train_model=train_CPT`\n",
    "    - when you want to use DA model: `base_param=[1,0]`, `pred=pred_DA, train_model=train_DA`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# large grid size for test\n",
    "alpha_grid_size = 0.1\n",
    "eta_grid_size = 0.1\n",
    "alpha_ubd = 1\n",
    "eta_lbd = -1\n",
    "eta_ubd = 5\n",
    "alpha_grid = np.arange(alpha_grid_size, alpha_ubd+alpha_grid_size, alpha_grid_size)\n",
    "eta_grid = np.arange(eta_lbd, eta_ubd+eta_grid_size, eta_grid_size)\n",
    "\n",
    "param_grid = cartesian_product(alpha_grid, eta_grid)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "num_subject: 179\n",
      "completeness: 0.2626652570580822\n",
      "stderr (analytical): 0.056904795213523214\n",
      "model_best_param: [1.   0.48]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(0.2626652570580822, 0.056904795213523214, array([1.  , 0.48]))"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cross_validation(X, Y, subject_idx, cov_idx, cov_list, param_grid,\n",
    "                 pred=pred_DA, train_model=train_DA,\n",
    "                 base_param=np.array([1,0]), n_splits=10, print_option=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "stderr (bootstrapped): 0.029125888127570106\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.029125888127570106"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Here, bootstrap sample size is set to be small for a test (also with coarse grid)\n",
    "se_bootstrap = bootstrap_se(X, Y, subject_idx, cov_idx, cov_list, param_grid, base_param=np.array([1,0]), pred=pred_DA, train_model=train_DA, bs_sample_size=10)\n",
    "se_bootstrap"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "7c6a6b8b72bec2727d3e7ff0edf122757d065e08cb6cfd6b86ed0a9c4ddb09ca"
  },
  "kernelspec": {
   "display_name": "Python 3.10.2 64-bit ('.venv': pipenv)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
